In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice (version: 3.13.0)You can find help files for any function by adding a ? before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
For this practical, we will use the EP16dat1 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.
To get the EP16dat1 dataset, load the file EP16dat1.RData. You can download it here.
To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.
If you know the path to the file, you can also use load("<path>/EP16dat1.RData").
If you have not followed the first practical or if you re-loaded the EP16dat1 data, you need to re-code the variable educ again:
EP16dat1$educ <- as.ordered(EP16dat1$educ)The imputed data are stored in a mids object called imp that we created in the previous practical.
You can load it into your workspace by clicking the object imps.RData if you are using RStudio. Alternatively, you can load this workspace using load("<path>/imps.RData").
mids objectsWhen we are confident that the imputation was successful, the imputed data can be analysed with any standard complete data method.
Depending on the research question, for the NHANES data you might use
lm()glm() with family = binomial()glm() with an appropriate familySince our imputed data imp is not a data.frame but a mids object, we cannot use the standard specification lm(formula = ..., data = imp, ...) but need to use the function with() from the mice package:
with(imp, lm(formula, ...))
class and structure of the resulting object.For example:
mod1 <- with(imp, lm(SBP ~ age + gender + bili + chol + BMI))
mod2 <- with(imp, glm(hypchol ~ age + gender + bili + race + BMI, family = "binomial"))
mod3 <- with(imp, glm(creat ~ age + gender + uricacid + WC + BMI, family = Gamma(link = "log")))
# You may, for example, choose one of these models:
mod1 <- with(imp, lm(SBP ~ age + gender + bili + chol + BMI))
mod2 <- with(imp, glm(hypchol ~ age + gender + bili + race + BMI,
family = "binomial"))
mod3 <- with(imp, glm(creat ~ age + gender + uricacid + WC + BMI,
family = Gamma(link = 'log')))# to determine the class
class(mod1)## [1] "mira" "matrix"
# to determine the structure
names(mod1)## [1] "call" "call1" "nmis" "analyses"
sapply(mod1, class)## call call1 nmis analyses
## "call" "call" "integer" "list"
# explore further
mod1$analyses## [[1]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age genderfemale bili chol BMI
## 91.8201 0.3961 -6.0867 -1.7343 1.0969 0.3472
##
##
## [[2]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age genderfemale bili chol BMI
## 95.7341 0.3935 -5.7506 -1.3572 0.3308 0.3300
##
##
## [[3]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age genderfemale bili chol BMI
## 94.1262 0.3836 -5.7174 -2.5476 1.1176 0.3040
##
##
## [[4]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age genderfemale bili chol BMI
## 93.7607 0.4083 -5.8519 -1.0055 0.6318 0.3152
##
##
## [[5]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age genderfemale bili chol BMI
## 94.7909 0.3863 -5.6186 -0.2588 0.5052 0.3196
summary(mod1$analyses[[1]])##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.069 -9.164 -0.947 7.956 67.742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.82008 3.54252 25.919 < 2e-16 ***
## age 0.39610 0.02987 13.261 < 2e-16 ***
## genderfemale -6.08667 0.98134 -6.202 8.15e-10 ***
## bili -1.73435 1.69586 -1.023 0.3067
## chol 1.09693 0.45654 2.403 0.0165 *
## BMI 0.34717 0.07880 4.406 1.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.77 on 994 degrees of freedom
## Multiple R-squared: 0.2227, Adjusted R-squared: 0.2188
## F-statistic: 56.97 on 5 and 994 DF, p-value: < 2.2e-16
mira objectsAnalyses performed on mids objects will result in a mira object (multiply imputed repeated analyses).
mira object is a list with the following elements:
call:
|
The call that created the mira object.
|
call1:
|
The call that created the mids object.
|
nmis:
|
The number of missing observations per column. |
analyses:
|
The models fitted on each of the imputed datasets: A list with imp$m components, where each component contains a fitted model object.
|
The class of each of the components of imp$analyses will depend on the type of model that was fitted.
If you want to check the results from one particular imputed dataset, e.g., the 3rd, you can access the fitted model via mod1$analyses[[3]]. Alternatively you can use the function getfit().
To pool the results from the repeated analyses contained in a mira object, the function pool() can be used.
Pool one of the provided mira objects (mod1, mod2 or mod3) and investigate its structure.
# pool the model
pooled1 <- pool(mod1)
# investigate the structure
class(pooled1)## [1] "mipo" "data.frame"
names(pooled1)## [1] "call" "m" "pooled" "glanced"
str(pooled1)## Classes 'mipo' and 'data.frame': 0 obs. of 4 variables:
## $ call : language pool(object = mod1)
## $ m : int 5
## $ pooled :'data.frame': 6 obs. of 11 variables:
## ..$ term : Factor w/ 6 levels "(Intercept)",..: 1 2 3 4 5 6
## ..$ m : int 5 5 5 5 5 5
## ..$ estimate: num 94.046 0.394 -5.805 -1.381 0.736 ...
## ..$ ubar : num 1.29e+01 9.02e-04 9.73e-01 2.87 2.11e-01 ...
## ..$ b : num 2.11 9.42e-05 3.17e-02 7.22e-01 1.26e-01 ...
## ..$ t : num 15.43697 0.00101 1.01112 3.73869 0.36258 ...
## ..$ dfcom : int 994 994 994 994 994 994
## ..$ df : num 125.9 236.2 713.2 67.9 22.1 ...
## ..$ riv : num 0.1964 0.1253 0.0391 0.3014 0.7158 ...
## ..$ lambda : num 0.1642 0.1114 0.0377 0.2316 0.4172 ...
## ..$ fmi : num 0.1771 0.1188 0.0404 0.2533 0.4636 ...
## $ glanced:'data.frame': 5 obs. of 12 variables:
## ..$ r.squared : num 0.223 0.206 0.207 0.217 0.199
## ..$ adj.r.squared: num 0.219 0.202 0.203 0.214 0.195
## ..$ sigma : num 14.8 14.9 14.9 14.8 14.9
## ..$ statistic : Named num 57 51.5 51.8 55.3 49.4
## .. ..- attr(*, "names")= chr [1:5] "value" "value" "value" "value" ...
## ..$ p.value : Named num 3.65e-52 1.37e-47 8.14e-48 9.90e-51 1.04e-45
## .. ..- attr(*, "names")= chr [1:5] "value" "value" "value" "value" ...
## ..$ df : Named num 5 5 5 5 5
## .. ..- attr(*, "names")= chr [1:5] "numdf" "numdf" "numdf" "numdf" ...
## ..$ logLik : num -4108 -4115 -4116 -4114 -4119
## ..$ AIC : num 8231 8244 8245 8241 8253
## ..$ BIC : num 8265 8279 8279 8276 8287
## ..$ deviance : num 216783 219748 219908 219047 221545
## ..$ df.residual : int 994 994 994 994 994
## ..$ nobs : int 1000 1000 1000 1000 1000
mipo objectsPooling of a mira object results in a mipo object (multiply imputed pooled analysis).
mipo object is a list that has the following elements:
call:
|
The call that created the mipo object.
|
m:
|
The number of imputed datasets. |
pooled:
|
A data.frame containing pooled estimates and variances.
|
pooled has the following columns:
estimate:
|
Pooled parameter estimates |
ubar:
|
The within-imputation variance of the estimate
|
b:
|
The between-imputation variance of the estimate
|
t:
|
The total variance of the estimate
|
dfcom:
|
Degrees of freedom in the complete data |
df:
|
Degrees of freedom associated with the \(t\)-statistics. |
riv:
|
The relative increase in variance due to the missing values: \((b+b/m)/ubar\) |
lambda:
|
Proportion of the variation due to the missing values: \((b+b/m)/t\). |
fmi:
|
Fraction of missing information. |
From the elements contained in the mipo object, we could calculate the pooled confidence intervals and p-values by hand, using the function provided in the course notes.
However, the function summary() applied to the mipo object will do this for us. This function has four more arguments that may be relevant:
conf.int: needs to be set to TRUE in order to to get confidence intervalsconf.level: the confidence level, by default this is 0.95exponentiate: whether to exponentiate the pooled estimates and confidence intervals (e.g., for logistic regression)Get the summary of the pooled results.
pooled1 <- pool(mod1)
summary(pooled1, conf.int = TRUE)## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept) 94.0463858 3.92899092 23.9365241 125.89054 0.000000e+00 86.2709627 101.8218088
## 2 age 0.3935746 0.03185725 12.3543153 236.17521 0.000000e+00 0.3308139 0.4563352
## 3 genderfemale -5.8050262 1.00554606 -5.7730087 713.18805 1.161878e-08 -7.7792106 -3.8308418
## 4 bili -1.3806815 1.93356811 -0.7140589 67.91568 4.776377e-01 -5.2391433 2.4777804
## 5 chol 0.7364822 0.60214286 1.2231021 22.10613 2.341702e-01 -0.5119381 1.9849026
## 6 BMI 0.3231938 0.08187090 3.9476031 614.16507 8.806992e-05 0.1624130 0.4839747
Note: pool() extracts the model coefficients and variance-covariance matrices using tidy from the package broom. Hence, pooling using the pool() function from mice only works for models of classes for which a method tidy exists.
© Nicole Erler